Importing Libraries

library("plotly")
## Warning: package 'plotly' was built under R version 3.2.5
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.5
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("ggplot2")
library("e1071")
library("nnet")
library("arules")
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     %in%, abbreviate, write

Reading the Data files

movie <- read.csv("C:/Users/Supreeth/Downloads/movie1-7550.csv")
ratings <- read.csv("C:/Users/Supreeth/Downloads/ratings0-140k.csv")

Removing Duplicate values

#Removing all the duplicates values by using combination of title and year released
movies_noduplicate <- movie[!duplicated(movie[c("title", "year_released")]),]

#Similarly all the duplicates in the ratings data set
ratings_noduplicate <- ratings[!duplicated(ratings[c("title", "year_released")]),]

#Merging both the data frames on a condition of title and year released
movies_complete <- merge(movies_noduplicate,ratings_noduplicate,by=c("title","year_released"))

Converting year variable into Character

Removing NA Values

#dimensions of the data frame
str(movies_complete)
## 'data.frame':    135540 obs. of  39 variables:
##  $ title            : Factor w/ 127357 levels "'71","'76","'79 Parts",..: 1 2 3 4 5 6 8 9 7 10 ...
##  $ year_released    : int  2014 2016 2016 2016 1993 1999 2004 2005 2016 2005 ...
##  $ Actors_Main      : Factor w/ 118160 levels "'Bee One Boyefio Borketey,Massiel Checo,Gary Davis",..: 96579 19424 32193 76702 100124 1793 62091 59107 106018 89795 ...
##  $ Actors_Mainid    : Factor w/ 118229 levels ",,","nm0000002,nm0000063,nm0001922",..: 82581 63878 3337 22378 45066 61161 71476 23672 111606 29296 ...
##  $ Also_Known_As    : Factor w/ 55041 levels "","'63 Cehennemi",..: 1957 1 1965 1 2041 53134 15349 1 45024 15482 ...
##  $ Aspect_Ratio     : Factor w/ 150 levels "","0.095833333",..: 90 1 49 1 49 1 49 1 1 1 ...
##  $ Budget           : Factor w/ 5347 levels "","$ 1","$ 1,00,00,000",..: 1 541 1 764 1 1 1 1 1 1 ...
##  $ Color            : Factor w/ 20 levels ""," Black and White",..: 10 16 10 10 10 10 10 10 10 10 ...
##  $ Country          : Factor w/ 5038 levels "","Afghanistan",..: 4187 3093 4544 4544 4066 3213 2651 2651 2651 2269 ...
##  $ Directors        : Factor w/ 71107 levels "","'Atlas' Ramachandran",..: 69686 26949 5398 59318 52276 6548 70163 35102 59729 68234 ...
##  $ Filming_locations: Factor w/ 15836 levels "","'Nab' school for blinds, India",..: 13982 1 3298 4458 1 1 8378 1 1 10850 ...
##  $ Gallery          : Factor w/ 8 levels "","http://ia.media-imdb.com/images/G/01/imdb/images/nopicture/small/unknown-1394846836._CB527141414_.png",..: 4 6 5 1 1 1 2 1 2 1 ...
##  $ Genres           : Factor w/ 3327 levels ""," Action"," Action, Adventure",..: 681 2795 461 3274 1 2 3023 3023 3023 2423 ...
##  $ Gross            : Factor w/ 9163 levels "","$1,000,000,(USA)",..: 243 1 1 1 1 1 1 1 1 1 ...
##  $ KeyWords         : Factor w/ 48832 levels "",".hack","007,terrorist cell,intelligence agency,computer cracker,official james bond series",..: 5323 33686 1 28668 10469 9784 8467 29533 1 29581 ...
##  $ Language         : Factor w/ 4253 levels "","Abkhazian",..: 563 1036 1063 563 3040 1628 2756 2756 2756 2385 ...
##  $ Opening_Weekend  : Factor w/ 8514 levels "","$1,000,(USA),(17 July 2015)",..: 5687 1 1 1 1 1 1 1 1 1 ...
##  $ Oscars           : Factor w/ 36 levels "","Nominated for 1 BAFTA Film Award.",..: 2 1 1 1 1 1 1 1 1 1 ...
##  $ Other_awards     : Factor w/ 836 levels "","Another 1 nomination.",..: 66 1 1 1 1 1 1 1 1 1 ...
##  $ Release          : Factor w/ 51887 levels "","1 April 1990 (Czechoslovakia)",..: 3937 30140 50689 2110 1 21550 34429 19347 20472 37157 ...
##  $ Runtime_min      : Factor w/ 293 levels "","1 min","1,151 min",..: 293 27 284 284 288 1 1 1 1 15 ...
##  $ Sound_Mix        : Factor w/ 537 levels "","12-Track Digital Sound",..: 139 1 1 1 1 263 508 1 1 139 ...
##  $ Writers          : Factor w/ 121298 levels "'Cowboy' Matt Hopewell,Mellisa Worster",..: 39455 94473 71389 50668 117138 38665 120189 120187 120188 69342 ...
##  $ Writersid        : Factor w/ 121382 levels ",","nm0000005,nm0256890",..: 85219 65837 38638 1979 69433 32201 69789 69790 69788 70653 ...
##  $ alsolikedmovies  : Factor w/ 53722 levels "","'76","'76,93 Days,The Arbitration,Taxi Driver: Oko Ashewo,Green White Green,Just Not Married,The Wedding Party,Kati Kati,The Wedding "| __truncated__,..: 1 33115 1 1 1 1 17516 1 1 37292 ...
##  $ certification    : Factor w/ 26 levels "","(BANNED)",..: 18 1 17 1 1 1 1 1 1 1 ...
##  $ description      : Factor w/ 86775 levels "","' Death is as fleeting as the vibrancy of life itself.'",..: 45081 72372 49716 1 20722 1 38283 1 1 36269 ...
##  $ directorid       : Factor w/ 71331 levels "","nm0000019",..: 29222 40730 31994 38668 10599 28061 35945 36646 31714 36227 ...
##  $ likedmovieids    : Factor w/ 53819 levels "","t0000001,t0000002,t0000003,t0000004,t0000099,t0009999",..: 1 53233 1 1 1 1 27818 1 1 22090 ...
##  $ metaScore        : int  83 NA NA NA NA NA NA NA NA NA ...
##  $ movie_duration   : Factor w/ 300 levels "","100h 21min",..: 54 75 45 45 49 1 34 1 29 63 ...
##  $ ratingCount      : Factor w/ 11759 levels "","1,00,033",..: 7128 1371 5734 1 11296 9347 8251 1371 1 11758 ...
##  $ rating_critics   : Factor w/ 532 levels "","1 critic",..: 142 1 1 1 1 1 3 1 1 80 ...
##  $ rating_users     : Factor w/ 993 levels "","0 user","1 critic",..: 138 1 3 1 4 1 827 1 1 192 ...
##  $ release_date     : Factor w/ 50756 levels "","1 April 1990 (Czechoslovakia)",..: 3870 29459 49550 2072 1 21061 33635 18905 20006 36296 ...
##  $ storyline        : Factor w/ 86796 levels "","' Feedback' : This piece looks at how the digital age has set a precedence for the way in which we judge others, and not always"| __truncated__,..: 16255 45634 47034 1 16243 1 55208 1 1 19579 ...
##  $ titlecast        : Factor w/ 126815 levels "","'Bee One Boyefio Borketey,Massiel Checo,Gary Davis,Divendre Hernandez,Gene Jenney,Dave Jia,Aeja Pinto,Shawna Tran",..: 49059 99608 35618 54037 21760 126701 67505 73379 114082 96763 ...
##  $ titlecastids     : Factor w/ 126844 levels "","nm0000002,nm0000063,nm0001922,nm0444321,nm0005331,nm0898696,nm0494157,nm0547462,nm0015804,nm0056350,nm0063009,nm0120205,nm01078"| __truncated__,..: 79442 74489 3886 9639 98173 60884 78506 79467 119756 34662 ...
##  $ rating           : num  7.2 7.5 6.8 NA 5.1 3.2 5.1 5.1 NA 6.8 ...
dim(movies_complete)
## [1] 135540     39
colnames(movies_complete)
##  [1] "title"             "year_released"     "Actors_Main"      
##  [4] "Actors_Mainid"     "Also_Known_As"     "Aspect_Ratio"     
##  [7] "Budget"            "Color"             "Country"          
## [10] "Directors"         "Filming_locations" "Gallery"          
## [13] "Genres"            "Gross"             "KeyWords"         
## [16] "Language"          "Opening_Weekend"   "Oscars"           
## [19] "Other_awards"      "Release"           "Runtime_min"      
## [22] "Sound_Mix"         "Writers"           "Writersid"        
## [25] "alsolikedmovies"   "certification"     "description"      
## [28] "directorid"        "likedmovieids"     "metaScore"        
## [31] "movie_duration"    "ratingCount"       "rating_critics"   
## [34] "rating_users"      "release_date"      "storyline"        
## [37] "titlecast"         "titlecastids"      "rating"
### What are the Total number of Movies reviewed by year?
#Converting year_released into character
movies_complete$year_released <- as.character(movies_complete$year_released)
temp <- movies_complete %>% select(title,year_released)
temp <- temp %>% group_by(year_released) %>% summarise(n=n())

temp <- na.omit(temp)
p <- plot_ly(temp, x = ~year_released, y = ~n, name = "Number of Movies by Year")
p
## No trace type specified:
##   Based on info supplied, a 'bar' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#bar

Average IMDB Rating per year

The overall trend of average score seems to be almost same. The lowest rating is seen in year “1995”.

temp1 <- movies_complete %>% select(rating,year_released)
temp1 <- na.omit(temp1)
temp1 <- temp1 %>% group_by(year_released)%>% summarise(score=mean(rating)) 


which(temp$year_released == 1970)
## [1] 1
which(temp$year_released == 1989)
## [1] 2
which(temp$year_released == 2017)
## [1] 30
temp1 <- temp1[-c(1,2,30),]

temp1=as.data.frame(temp1)

p <- plot_ly(temp1, x = ~year_released, y = ~score, name = 'Avg Imdb Rating per year', type = 'scatter', mode = 'lines')
p

Are content ratings tied to imdb_score?

#Avg score for each type of content_rating
temp2 <- movies_complete %>% select(certification,rating)
temp2 <- na.omit(temp2)

temp2 <- temp2 %>% group_by(certification)%>% summarise(score = mean(rating))
p <- plot_ly(temp2,
             x = ~certification,
             y = ~score,
             name = "Avg score by Rating",
             type = "bar", color = "red")
p
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

How do the scores vary by category?

#We see that the highest average score seems to be bagged by TV-Y category. 

temp4 <- movies_complete %>% select(rating,certification)
temp4 <- na.omit(temp4)
temp4=as.data.frame(temp4)

p <- plot_ly(temp4, x = ~rating, color = ~certification, type = "box")
p
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
# From the box plot the Inter Quartile Range for X-rated movies is very low. Also, PG-13 Category has largest number of outliers from 1.4 to 3.4. 

Analyzing Directors with minimum of 10 movies

# Avg imdb rating for top 10 directors who has directed atleast 10 movies
# Splitting list of directors and ploting considering the first director

str(movies_complete$Directors)
##  Factor w/ 71107 levels "","'Atlas' Ramachandran",..: 69686 26949 5398 59318 52276 6548 70163 35102 59729 68234 ...
movies_complete$Directors <- as.character(movies_complete$Directors)
for (i in 1:length(movies_complete$Directors)) {
  dir1 = unlist(strsplit(movies_complete$Directors[i],","))[1]
  movies_complete$Directors[i] = dir1
}

#Directors average of IMDB rating with minimum of 10 movies

avg_rating_director <- movies_complete %>% group_by(Directors) %>% mutate(.,no_rows = length(Directors)) %>% select(Directors, rating, no_rows) %>% filter(., no_rows > 10) %>% summarise_each(funs(mean(., na.rm=TRUE)))

avg_rating_director <- avg_rating_director %>% arrange(desc(rating))
avg_rating_director <- avg_rating_director[1:10,]

p <- plot_ly(avg_rating_director, x = ~Directors , y = ~ rating, color = ~Directors)
p
## No trace type specified:
##   Based on info supplied, a 'bar' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#bar
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
#Among the top 10 Directors, we see that "Jeff Tudor" and "larry Rossen" has got highest IMDB Ratings.

FILTERING DATA

movies = read.csv("C:/Users/Supreeth/Downloads/movies_complete.csv",stringsAsFactors = F, header = T)
str(movies)
## 'data.frame':    135540 obs. of  26 variables:
##  $ movie_title    : chr  "'71" "'76" "'79 Parts" "'85: The Greatest Team in Pro Football History" ...
##  $ title_year     : int  2014 2016 2016 2016 1993 1999 2004 2005 2016 2005 ...
##  $ Actors_Main1   : chr  "Sam Reid" "Chidi Mokeme" "Eric Roberts" "Mike Ditka" ...
##  $ Actors_Main2   : chr  "Sean Harris" "" "Sandra Bernhard" "" ...
##  $ Actors_Main3   : chr  "" "" "Kathrine Narducci" "" ...
##  $ Actors_Mainid1 : chr  "nm2730580" "nm1421120" "nm0000616" "nm0228491" ...
##  $ Actors_Mainid2 : chr  "nm0365317" "" "nm0000928" "" ...
##  $ Actors_Mainid3 : chr  "" "" "nm0621393" "" ...
##  $ Aspect_Ratio   : chr  "2.35 : 1" "" "1.85 : 1" "" ...
##  $ Budget         : chr  "" "$ 30,00,000" "" "$ 6,50,000" ...
##  $ Country        : chr  "UK" "Nigeria" "USA" "USA" ...
##  $ Director1      : chr  "Yann Demange" "Izu Ojukwu" "Ari Taub" "Scott Prestin" ...
##  $ Director2      : chr  "" "" "" "" ...
##  $ Genres         : chr  " Action, Drama, Thriller, War" " Drama, Romance" " Action, Comedy, Romance" " Sport" ...
##  $ Gross          : chr  "$1,268,760,(USA),(1 May 2015)" "" "" "" ...
##  $ Language       : chr  "English" "English,Ibo" "English,Italian" "English" ...
##  $ Opening_Weekend: chr  "$55,761,(USA),(27 February 2015)" "" "" "" ...
##  $ Runtime_min    : chr  "99 min" "118 min" "90 min" "90 min" ...
##  $ alsolikedmovies: chr  "" "Okafor's Law,93 Days,Taxi Driver: Oko Ashewo,Vaya,The Arbitration,Green White Green,Fonko,White Colour Black,Just Not Married" "" "" ...
##  $ certification  : chr  "R" "" "PG-13" "" ...
##  $ directorid     : chr  "nm1312919" "nm2339266" "nm1500658" "nm2115038" ...
##  $ likedmovieids  : chr  "" "t5955112,t5305246,t5112438,t4996954,t5811182,t5978754,t5724542,t4932754,t5978766,t5643398,t5974454,t4936122" "" "" ...
##  $ ratingCount    : chr  "37,011" "10" "29" "" ...
##  $ rating_critics : chr  "225 critic" "" "" "" ...
##  $ rating_users   : chr  "105 user" "" "1 critic" "" ...
##  $ rating         : num  7.2 7.5 6.8 NA 5.1 3.2 5.1 5.1 NA 6.8 ...
movies_1 =  filter(movies,Budget!= "")
movies_1 =  filter(movies_1,Actors_Main1!= "")
movies_1 =  filter(movies_1,Actors_Main2!= "")
movies_1 =  filter(movies_1,Actors_Main3!= "")
movies_1 =  filter(movies_1,directorid!= "")
movies_1 =  filter(movies_1,Runtime_min!= "")

CLEANING DATA

mymovies = movies_1
 
Budget = mymovies$Budget
library(stringi)
currency =vector()
Budget_d = vector()
currency = stri_extract(mymovies$Budget, regex='[^0-9]*')

Budget_new = vector()
for (i in 1:length(Budget)){
  length1 = nchar(currency[i])
  Budget_new[i] = substr(Budget[i],length1+1,100)
  
}

Budget_new = as.numeric(gsub(",", "", Budget_new))

library(gdata)
## gdata: Unable to locate valid perl interpreter
## gdata: 
## gdata: read.xls() will be unable to read Excel XLS and XLSX files
## gdata: unless the 'perl=' argument is used to specify the location
## gdata: of a valid perl intrpreter.
## gdata: 
## gdata: (To avoid display of this message in the future, please
## gdata: ensure perl is installed and available on the executable
## gdata: search path.)
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLX' (Excel 97-2004) files.
## 
## gdata: Unable to load perl libaries needed by read.xls()
## gdata: to support 'XLSX' (Excel 2007+) files.
## 
## gdata: Run the function 'installXLSXsupport()'
## gdata: to automatically download and install the perl
## gdata: libaries needed to support Excel XLS and XLSX formats.
## 
## Attaching package: 'gdata'
## The following objects are masked from 'package:dplyr':
## 
##     combine, first, last
## The following object is masked from 'package:stats':
## 
##     nobs
## The following object is masked from 'package:utils':
## 
##     object.size
currency = trim(currency)
Gross = mymovies$Gross
Gross_new =vector()

for (i in 1:length(Gross)){
  if (nchar(Gross[i]) > 1) {
    abc = strsplit(Gross[i], "U")[[1]][1]
    abc = substr(abc, 2, nchar(abc)-2)
    Gross_new[i] = abc
  }else{
    Gross_new[i] = ""
  }
}

Gross_new = as.numeric(gsub(",", "", Gross_new))
mymovies["currency"] = NA
mymovies["Budget_new"] = NA
mymovies["Gross_new"] = NA
mymovies$currency = currency
mymovies$Budget_new = Budget_new
mymovies$Gross_new = Gross_new
head(Gross_new)
## [1]       NA       NA 32391374       NA       NA       NA
head(Budget_new)
## [1] 1.0e+04 1.0e+06 7.5e+06 9.0e+06 1.3e+07 7.0e+06
write.csv(mymovies,"C:/Users/Supreeth/Downloads/movies_current.csv")

PERFORMING DATA ANALYSIS

#Linear Regression
movies_model <- read.csv("C:/Users/Supreeth/Downloads/movie_metadata.csv/movie_metadata.csv",header=TRUE)

movies_model <- na.omit(movies_model)

str(movies_model)
## 'data.frame':    3801 obs. of  28 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ director_name            : Factor w/ 2399 levels "","A. Raven Cruz",..: 929 801 2027 380 109 2030 1652 1228 554 2394 ...
##  $ num_critic_for_reviews   : int  723 302 602 813 462 392 324 635 375 673 ...
##  $ duration                 : int  178 169 148 164 132 156 100 141 153 183 ...
##  $ director_facebook_likes  : int  0 563 0 22000 475 0 15 0 282 0 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 530 4000 284 19000 10000 2000 ...
##  $ actor_2_name             : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2549 1228 801 2440 653 1704 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 640 24000 799 26000 25000 15000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 73058679 336530303 200807262 458991599 301956980 330249062 ...
##  $ genres                   : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 126 120 308 126 447 126 ...
##  $ actor_1_name             : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 305 983 355 1968 443 787 223 338 35 741 ...
##  $ movie_title              : Factor w/ 4917 levels "#Horror ","[Rec] 2 ",..: 398 2731 3279 3707 1961 3289 3459 399 1631 461 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 212204 383056 294810 462669 321795 371639 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 1873 46055 2036 92000 58753 24450 ...
##  $ actor_3_name             : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1395 3134 1771 2714 1970 2163 3018 2941 58 ...
##  $ facenumber_in_poster     : int  0 0 1 0 1 0 1 4 3 0 ...
##  $ plot_keywords            : Factor w/ 4761 levels "","10 year old|dog|florida|girl|supermarket",..: 1320 4283 2076 3484 651 4745 29 1142 2005 1564 ...
##  $ movie_imdb_link          : Factor w/ 4919 levels "http://www.imdb.com/title/tt0006864/?ref_=fn_tt_tt_1",..: 2965 2721 4533 3756 2476 2526 2458 4546 2551 4690 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 738 1902 387 1117 973 3018 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 65 63 65 65 65 65 65 63 65 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 10 10 10 10 10 10 9 10 9 10 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 2.64e+08 ...
##  $ title_year               : int  2009 2007 2015 2012 2012 2007 2010 2015 2009 2016 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 632 11000 553 21000 11000 4000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 6.6 6.2 7.8 7.5 7.5 6.9 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 2.35 2.35 1.85 2.35 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 24000 0 29000 118000 10000 197000 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:1242] 5 56 85 99 100 178 200 205 207 243 ...
##   .. ..- attr(*, "names")= chr [1:1242] "5" "56" "85" "99" ...
colnames(movies_model)
##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"
#COnsidering all the numeric variables
numeric <- sapply(movies_model,is.numeric)

movies_numeric <- movies_model[,numeric]
colnames(movies_numeric)
##  [1] "num_critic_for_reviews"    "duration"                 
##  [3] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [5] "actor_1_facebook_likes"    "gross"                    
##  [7] "num_voted_users"           "cast_total_facebook_likes"
##  [9] "facenumber_in_poster"      "num_user_for_reviews"     
## [11] "budget"                    "title_year"               
## [13] "actor_2_facebook_likes"    "imdb_score"               
## [15] "aspect_ratio"              "movie_facebook_likes"
#View(movies_model)

str(movies_numeric)
## 'data.frame':    3801 obs. of  16 variables:
##  $ num_critic_for_reviews   : int  723 302 602 813 462 392 324 635 375 673 ...
##  $ duration                 : int  178 169 148 164 132 156 100 141 153 183 ...
##  $ director_facebook_likes  : int  0 563 0 22000 475 0 15 0 282 0 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 530 4000 284 19000 10000 2000 ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 640 24000 799 26000 25000 15000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 73058679 336530303 200807262 458991599 301956980 330249062 ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 212204 383056 294810 462669 321795 371639 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 1873 46055 2036 92000 58753 24450 ...
##  $ facenumber_in_poster     : int  0 0 1 0 1 0 1 4 3 0 ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 738 1902 387 1117 973 3018 ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 2.64e+08 ...
##  $ title_year               : int  2009 2007 2015 2012 2012 2007 2010 2015 2009 2016 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 632 11000 553 21000 11000 4000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 6.6 6.2 7.8 7.5 7.5 6.9 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 2.35 2.35 1.85 2.35 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 24000 0 29000 118000 10000 197000 ...
model <- lm(imdb_score ~ . -aspect_ratio - title_year - facenumber_in_poster, data = movies_numeric)
summary(model)
## 
## Call:
## lm(formula = imdb_score ~ . - aspect_ratio - title_year - facenumber_in_poster, 
##     data = movies_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6111 -0.4915  0.0772  0.6142  2.4800 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4.845e+00  7.505e-02  64.554  < 2e-16 ***
## num_critic_for_reviews     1.543e-03  1.844e-04   8.367  < 2e-16 ***
## duration                   1.195e-02  6.747e-04  17.717  < 2e-16 ***
## director_facebook_likes    7.442e-06  4.895e-06   1.520 0.128492    
## actor_3_facebook_likes     7.342e-05  2.192e-05   3.350 0.000816 ***
## actor_1_facebook_likes     7.716e-05  1.334e-05   5.785 7.83e-09 ***
## gross                     -1.813e-09  2.752e-10  -6.587 5.10e-11 ***
## num_voted_users            4.016e-06  1.765e-07  22.758  < 2e-16 ***
## cast_total_facebook_likes -7.683e-05  1.329e-05  -5.780 8.09e-09 ***
## num_user_for_reviews      -5.564e-04  5.868e-05  -9.483  < 2e-16 ***
## budget                    -5.669e-11  6.341e-11  -0.894 0.371389    
## actor_2_facebook_likes     7.932e-05  1.403e-05   5.652 1.70e-08 ***
## movie_facebook_likes      -2.786e-06  9.793e-07  -2.844 0.004473 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8704 on 3788 degrees of freedom
## Multiple R-squared:  0.3249, Adjusted R-squared:  0.3227 
## F-statistic: 151.9 on 12 and 3788 DF,  p-value: < 2.2e-16
#From the summary we could observe that Budget and director facebook likes 
#are not significant

#step(model)


model1 <- lm(imdb_score ~ num_critic_for_reviews + duration + actor_3_facebook_likes + actor_1_facebook_likes + gross + num_voted_users + cast_total_facebook_likes + num_user_for_reviews + 
    actor_2_facebook_likes + movie_facebook_likes, data = movies_numeric)

summary(model1)
## 
## Call:
## lm(formula = imdb_score ~ num_critic_for_reviews + duration + 
##     actor_3_facebook_likes + actor_1_facebook_likes + gross + 
##     num_voted_users + cast_total_facebook_likes + num_user_for_reviews + 
##     actor_2_facebook_likes + movie_facebook_likes, data = movies_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6128 -0.4937  0.0754  0.6112  2.4820 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4.839e+00  7.481e-02  64.677  < 2e-16 ***
## num_critic_for_reviews     1.532e-03  1.839e-04   8.331  < 2e-16 ***
## duration                   1.202e-02  6.713e-04  17.906  < 2e-16 ***
## actor_3_facebook_likes     7.373e-05  2.192e-05   3.364 0.000776 ***
## actor_1_facebook_likes     7.718e-05  1.334e-05   5.786 7.79e-09 ***
## gross                     -1.857e-09  2.741e-10  -6.777 1.41e-11 ***
## num_voted_users            4.071e-06  1.731e-07  23.516  < 2e-16 ***
## cast_total_facebook_likes -7.681e-05  1.329e-05  -5.778 8.18e-09 ***
## num_user_for_reviews      -5.589e-04  5.866e-05  -9.528  < 2e-16 ***
## actor_2_facebook_likes     7.944e-05  1.404e-05   5.660 1.63e-08 ***
## movie_facebook_likes      -2.765e-06  9.789e-07  -2.825 0.004756 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8705 on 3790 degrees of freedom
## Multiple R-squared:  0.3243, Adjusted R-squared:  0.3225 
## F-statistic: 181.9 on 10 and 3790 DF,  p-value: < 2.2e-16

The Adjusted R Square is around 32%. Linear model is not the best model here

We decided to Created Bins for the Class Variable, i.e, IMDB Score.

Score from 1 - 4 are under Bin 1, 4 - 5.5 are under Bin 2, 5.5 - 7 are under Bin 3, 7 - 8.5 are under Bin 4 and 8.5 - 10 are under Bin 5

And Apply different Classification Techniques

movies_model <- read.csv("C:/Users/Supreeth/Downloads/movie_metadata.csv/movie_metadata.csv",header=TRUE)
movies_model <- na.omit(movies_model)

row.names(movies_model) <- 1:nrow(movies_model)

# creating bins
movies_model$class <- cut(movies_model$imdb_score, breaks = c(0, seq(4, 10, by = 1.5)), labels = 1:5)

# create a class as factor variables
movies_model$class <- as.factor(movies_model$class)
levels(movies_model$class)
## [1] "1" "2" "3" "4" "5"
movies_model$country <- as.factor(movies_model$country)
levels(movies_model$country)
##  [1] ""                     "Afghanistan"          "Argentina"           
##  [4] "Aruba"                "Australia"            "Bahamas"             
##  [7] "Belgium"              "Brazil"               "Bulgaria"            
## [10] "Cambodia"             "Cameroon"             "Canada"              
## [13] "Chile"                "China"                "Colombia"            
## [16] "Czech Republic"       "Denmark"              "Dominican Republic"  
## [19] "Egypt"                "Finland"              "France"              
## [22] "Georgia"              "Germany"              "Greece"              
## [25] "Hong Kong"            "Hungary"              "Iceland"             
## [28] "India"                "Indonesia"            "Iran"                
## [31] "Ireland"              "Israel"               "Italy"               
## [34] "Japan"                "Kenya"                "Kyrgyzstan"          
## [37] "Libya"                "Mexico"               "Netherlands"         
## [40] "New Line"             "New Zealand"          "Nigeria"             
## [43] "Norway"               "Official site"        "Pakistan"            
## [46] "Panama"               "Peru"                 "Philippines"         
## [49] "Poland"               "Romania"              "Russia"              
## [52] "Slovakia"             "Slovenia"             "South Africa"        
## [55] "South Korea"          "Soviet Union"         "Spain"               
## [58] "Sweden"               "Switzerland"          "Taiwan"              
## [61] "Thailand"             "Turkey"               "UK"                  
## [64] "United Arab Emirates" "USA"                  "West Germany"
movies_model$content_rating <- as.factor(movies_model$content_rating)
levels(movies_model$content_rating)
##  [1] ""          "Approved"  "G"         "GP"        "M"        
##  [6] "NC-17"     "Not Rated" "Passed"    "PG"        "PG-13"    
## [11] "R"         "TV-14"     "TV-G"      "TV-MA"     "TV-PG"    
## [16] "TV-Y"      "TV-Y7"     "Unrated"   "X"
movies_model$language <- as.factor(movies_model$language)
levels(movies_model$language)
##  [1] ""           "Aboriginal" "Arabic"     "Aramaic"    "Bosnian"   
##  [6] "Cantonese"  "Chinese"    "Czech"      "Danish"     "Dari"      
## [11] "Dutch"      "Dzongkha"   "English"    "Filipino"   "French"    
## [16] "German"     "Greek"      "Hebrew"     "Hindi"      "Hungarian" 
## [21] "Icelandic"  "Indonesian" "Italian"    "Japanese"   "Kannada"   
## [26] "Kazakh"     "Korean"     "Mandarin"   "Maya"       "Mongolian" 
## [31] "None"       "Norwegian"  "Panjabi"    "Persian"    "Polish"    
## [36] "Portuguese" "Romanian"   "Russian"    "Slovenian"  "Spanish"   
## [41] "Swahili"    "Swedish"    "Tamil"      "Telugu"     "Thai"      
## [46] "Urdu"       "Vietnamese" "Zulu"
match("director_name",names(movies_model))
## [1] 2
match("actor_2_name",names(movies_model))
## [1] 7
match("genres",names(movies_model))
## [1] 10
match("actor_1_name",names(movies_model))
## [1] 11
match("movie_title",names(movies_model))
## [1] 12
match("actor_3_name",names(movies_model))
## [1] 15
match("plot_keywords",names(movies_model))
## [1] 17
match("movie_imdb_link",names(movies_model))
## [1] 18
match("imdb_score",names(movies_model))
## [1] 26
match("aspect_ratio",names(movies_model))
## [1] 27
match("title_year",names(movies_model))
## [1] 24
movies1 <- movies_model[,-c(2,7,10,11,12,15,17,18,26,27,24)]
colnames(movies1)
##  [1] "color"                     "num_critic_for_reviews"   
##  [3] "duration"                  "director_facebook_likes"  
##  [5] "actor_3_facebook_likes"    "actor_1_facebook_likes"   
##  [7] "gross"                     "num_voted_users"          
##  [9] "cast_total_facebook_likes" "facenumber_in_poster"     
## [11] "num_user_for_reviews"      "language"                 
## [13] "country"                   "content_rating"           
## [15] "budget"                    "actor_2_facebook_likes"   
## [17] "movie_facebook_likes"      "class"

Logistic Regression

# spliting into train and test data 
smp_size <- floor(0.75 * nrow(movies1))
set.seed(123)
train_ind <- sample(seq_len(nrow(movies1)), size = smp_size)
train <- movies1[train_ind, ]
test <- movies1[-train_ind, ]
colnames(train)
##  [1] "color"                     "num_critic_for_reviews"   
##  [3] "duration"                  "director_facebook_likes"  
##  [5] "actor_3_facebook_likes"    "actor_1_facebook_likes"   
##  [7] "gross"                     "num_voted_users"          
##  [9] "cast_total_facebook_likes" "facenumber_in_poster"     
## [11] "num_user_for_reviews"      "language"                 
## [13] "country"                   "content_rating"           
## [15] "budget"                    "actor_2_facebook_likes"   
## [17] "movie_facebook_likes"      "class"
str(train)
## 'data.frame':    2850 obs. of  18 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ num_critic_for_reviews   : int  21 150 634 425 79 252 72 55 34 42 ...
##  $ duration                 : int  117 124 95 92 111 117 101 117 94 101 ...
##  $ director_facebook_likes  : int  71 78 246 22 17 14000 545 133 36 30 ...
##  $ actor_3_facebook_likes   : int  547 4 751 159 652 358 545 249 507 223 ...
##  $ actor_1_facebook_likes   : int  723 6 26000 648 13000 535 1000 687 13000 1000 ...
##  $ gross                    : int  70100000 439162 42043633 25138292 32940507 52792307 70360285 20966644 13801755 1075288 ...
##  $ num_voted_users          : int  20183 106160 277172 66483 54316 12572 81783 29610 23928 7772 ...
##  $ cast_total_facebook_likes: int  3841 28 29551 1122 16537 1950 3816 1665 15183 2228 ...
##  $ facenumber_in_poster     : int  1 0 0 0 1 0 0 0 1 3 ...
##  $ num_user_for_reviews     : int  41 430 986 452 127 106 283 94 100 55 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 24 13 13 13 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 34 65 65 65 63 65 63 65 65 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 11 11 11 11 11 9 10 18 9 11 ...
##  $ budget                   : num  4.0e+07 1.1e+09 3.0e+07 3.5e+06 6.0e+07 1.4e+08 1.8e+07 3.0e+06 2.0e+07 2.2e+07 ...
##  $ actor_2_facebook_likes   : int  558 5 821 191 969 400 1000 443 828 854 ...
##  $ movie_facebook_likes     : int  0 0 66000 43000 0 27000 0 0 0 145 ...
##  $ class                    : Factor w/ 5 levels "1","2","3","4",..: 2 4 3 3 3 3 3 4 3 3 ...
dim(train)
## [1] 2850   18
mymodel <- multinom(class ~ .-facenumber_in_poster, data = train)
## # weights:  730 (580 variable)
## initial  value 4586.898050 
## iter  10 value 3693.333075
## iter  20 value 3429.352710
## iter  30 value 3232.303764
## iter  40 value 3027.547575
## iter  50 value 2938.229585
## iter  60 value 2772.124553
## iter  70 value 2656.795452
## iter  80 value 2587.257081
## iter  90 value 2554.710336
## iter 100 value 2518.324855
## final  value 2518.324855 
## stopped after 100 iterations
predict <- predict(mymodel,test)

cm <- table(predict, test$class)
cm
##        
## predict   1   2   3   4   5
##       1   0   1   0   2   0
##       2   2   8   8   2   0
##       3  25 151 436 163   0
##       4   0   2  31 112   6
##       5   0   0   0   1   1
accuracy <- sum(test$class == predict)/length(test$class)
accuracy
## [1] 0.5856993

SVM CLASSIFICATION

# spliting into train and test data 
smp_size <- floor(0.75 * nrow(movies1))
set.seed(123)
train_ind <- sample(seq_len(nrow(movies1)), size = smp_size)
train <- movies1[train_ind, ]
test <- movies1[-train_ind, ]

# SVM model for movies1 <-  predicting the imdb_score
model <- svm(train$class ~ ., train)
res <- predict(model,newdata = test)

#COnfusion Matrix
table(res,test$class)
##    
## res   1   2   3   4   5
##   1   0   0   0   0   0
##   2   0   0   0   0   0
##   3  27 160 458 169   0
##   4   0   2  17 111   6
##   5   0   0   0   0   1
accuracy <- sum(test$class == res)/length(test$class)
accuracy
## [1] 0.5993691

IMPLEMENTING RANDOM FOREST

library("randomForest")
## Warning: package 'randomForest' was built under R version 3.2.5
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:gdata':
## 
##     combine
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
#colnames(movies_model)
colnames(train)
##  [1] "color"                     "num_critic_for_reviews"   
##  [3] "duration"                  "director_facebook_likes"  
##  [5] "actor_3_facebook_likes"    "actor_1_facebook_likes"   
##  [7] "gross"                     "num_voted_users"          
##  [9] "cast_total_facebook_likes" "facenumber_in_poster"     
## [11] "num_user_for_reviews"      "language"                 
## [13] "country"                   "content_rating"           
## [15] "budget"                    "actor_2_facebook_likes"   
## [17] "movie_facebook_likes"      "class"
str(train)
## 'data.frame':    2850 obs. of  18 variables:
##  $ color                    : Factor w/ 3 levels ""," Black and White",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ num_critic_for_reviews   : int  21 150 634 425 79 252 72 55 34 42 ...
##  $ duration                 : int  117 124 95 92 111 117 101 117 94 101 ...
##  $ director_facebook_likes  : int  71 78 246 22 17 14000 545 133 36 30 ...
##  $ actor_3_facebook_likes   : int  547 4 751 159 652 358 545 249 507 223 ...
##  $ actor_1_facebook_likes   : int  723 6 26000 648 13000 535 1000 687 13000 1000 ...
##  $ gross                    : int  70100000 439162 42043633 25138292 32940507 52792307 70360285 20966644 13801755 1075288 ...
##  $ num_voted_users          : int  20183 106160 277172 66483 54316 12572 81783 29610 23928 7772 ...
##  $ cast_total_facebook_likes: int  3841 28 29551 1122 16537 1950 3816 1665 15183 2228 ...
##  $ facenumber_in_poster     : int  1 0 0 0 1 0 0 0 1 3 ...
##  $ num_user_for_reviews     : int  41 430 986 452 127 106 283 94 100 55 ...
##  $ language                 : Factor w/ 48 levels "","Aboriginal",..: 13 24 13 13 13 13 13 13 13 13 ...
##  $ country                  : Factor w/ 66 levels "","Afghanistan",..: 65 34 65 65 65 63 65 63 65 65 ...
##  $ content_rating           : Factor w/ 19 levels "","Approved",..: 11 11 11 11 11 9 10 18 9 11 ...
##  $ budget                   : num  4.0e+07 1.1e+09 3.0e+07 3.5e+06 6.0e+07 1.4e+08 1.8e+07 3.0e+06 2.0e+07 2.2e+07 ...
##  $ actor_2_facebook_likes   : int  558 5 821 191 969 400 1000 443 828 854 ...
##  $ movie_facebook_likes     : int  0 0 66000 43000 0 27000 0 0 0 145 ...
##  $ class                    : Factor w/ 5 levels "1","2","3","4",..: 2 4 3 3 3 3 3 4 3 3 ...
model_randomforest  <- randomForest(class ~ . -country, data = train)
summary(model_randomforest)
##                 Length Class  Mode     
## call                3  -none- call     
## type                1  -none- character
## predicted        2850  factor numeric  
## err.rate         3000  -none- numeric  
## confusion          30  -none- numeric  
## votes           14250  matrix numeric  
## oob.times        2850  -none- numeric  
## classes             5  -none- character
## importance         16  -none- numeric  
## importanceSD        0  -none- NULL     
## localImportance     0  -none- NULL     
## proximity           0  -none- NULL     
## ntree               1  -none- numeric  
## mtry                1  -none- numeric  
## forest             14  -none- list     
## y                2850  factor numeric  
## test                0  -none- NULL     
## inbag               0  -none- NULL     
## terms               3  terms  call
pred =predict(model_randomforest,test)

confusion_matrix <- table(pred , test$class)
confusion_matrix
##     
## pred   1   2   3   4   5
##    1   1   2   0   0   0
##    2   1  28   5   1   0
##    3  25 121 373  69   0
##    4   0  11  97 210   4
##    5   0   0   0   0   3
accuracy <- sum(test$class == pred)/length(test$class)
accuracy
## [1] 0.6466877
varImpPlot(model_randomforest)

#Considering the important parameters from the plot
colnames(train)
##  [1] "color"                     "num_critic_for_reviews"   
##  [3] "duration"                  "director_facebook_likes"  
##  [5] "actor_3_facebook_likes"    "actor_1_facebook_likes"   
##  [7] "gross"                     "num_voted_users"          
##  [9] "cast_total_facebook_likes" "facenumber_in_poster"     
## [11] "num_user_for_reviews"      "language"                 
## [13] "country"                   "content_rating"           
## [15] "budget"                    "actor_2_facebook_likes"   
## [17] "movie_facebook_likes"      "class"
model_rm <- randomForest(class ~ num_voted_users + budget + gross + num_critic_for_reviews + actor_3_facebook_likes + cast_total_facebook_likes, data = train)

pred_ran_n=predict(model_rm,test)

confusion_matrix <- table(pred_ran_n, test$class)
confusion_matrix
##           
## pred_ran_n   1   2   3   4   5
##          1   1   0   0   0   0
##          2   2  29  17   3   0
##          3  23 131 397 116   0
##          4   1   2  61 161   4
##          5   0   0   0   0   3
accuracy <- sum(test$class == pred_ran_n)/length(test$class)
accuracy
## [1] 0.6214511

PERFORMING ASSOCIATION MINING

Identifying the Best Genres using Item Frequency and Support Values

movies_model=movies_model[movies_model$imdb_score >= 7,]
movies_model$imdb_score <-factor(movies_model$imdb_score)


bestGenre <- split(x=movies_model[,"genres"],f=movies_model$imdb_score)

bestGenre <- lapply(bestGenre,unique)
bestGenre <- as(bestGenre,"transactions")
#itemFrequency(bestGenre)

itemFrequencyPlot(bestGenre,support=.4,cex.names=1.5)

IDENTIFYING BEST DIRECTORS

bestdirector <- split(x=movies_model[,"director_name"],f=movies_model$imdb_score)
bestdirector <- lapply(bestdirector,unique)
bestdirector <- as(bestdirector,"transactions")

#itemFrequency(bestdirector)

itemFrequencyPlot(bestdirector,support=.25,cex.names=1.5)

IDENTIFYING BEST LEAD ACTORS

best_actors1 <- split(x=movies_model[,"actor_1_name"],f=movies_model$imdb_score)
best_actors1 <- lapply(best_actors1,unique)
best_actors1 <- as(best_actors1,"transactions")

#itemFrequency(best_actors1)

itemFrequencyPlot(best_actors1,support=.3,cex.names=1.5)

IDENTIFYING BEST SECONDARY ACTORS

best_actors2 <- split(x=movies_model[,"actor_2_name"],f=movies_model$imdb_score)
best_actors2 <- lapply(best_actors2,unique)
best_actors2 <- as(best_actors2,"transactions")

#itemFrequency(best_actors2)

itemFrequencyPlot(best_actors2,support=.15,cex.names=1.5)

DATA VISUALISATION

Using visuals, we are going to answer the following questions

1. Are facebook_likes indicator of the imdb_score?

From the graph, we can see no pattern to identify correlation between directors facebook likes and imdb_score.

fbbusget <- movies_model %>% select(cast_total_facebook_likes,budget,movie_title,content_rating)
p=plot_ly(fbbusget, x = ~cast_total_facebook_likes, y = ~budget,
          color = ~content_rating , mode = "markers",text= ~movie_title)
p
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

We did not see much of correlation between cast_total_facebook_likes and budget. For example, The Legend of Ron Burgundy has very high facebook likes but low budget. On the contrary, for the movie “The Host” the cast has less facebook likes but the movie has hagh budget

imdb_dist <- ggplot(movies_model, aes(x = imdb_score))+ geom_histogram(stat = "bin", fill = "blue")
imdb_dist
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution of imdb score appears to be approximately Normally Distributed with mean as 6. It has long tail on the left because there are more number of movies less than 2.5 than the movies on the right (>7.5)

Plotting IMDB Score with variables - Gross Income, Budget and Duration to understand the Correlation

## Scatter Plot for Gross Income and imdb_score
ggplot(movies_model, aes(x = gross, y = imdb_score))+ geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam'

## Scatter Plot for Budget and imdb_score
ggplot(movies_model, aes(x = budget, y = imdb_score))+ geom_point() + geom_smooth()
## `geom_smooth()` using method = 'gam'

## Duration vs imdb score
dur_imdbscore <- movies_model %>% select(duration, gross, imdb_score) %>% ggplot(aes(duration,imdb_score, color=gross))+geom_point()
dur_imdbscore

We observe that the duration of the movie has a positive correlationship with imdb score

LIMITATIONS

1. The existence of foreign currencies skews the budget data for foreign films particularly for currencies with extreme exchange rates when compared to USD

2. We have considered limited number of attributes for building our model. Taking other attributes from other sources may improve the model and analysis.Also external factors may be involved.

3. The distribution of people watching movies is different for different regions.

CONCLUSION

1. Duration affects the IMDB rating. The longer the movie the higher the rating.

2. Budget is also important.

3. The number of user reviews seem to be higher rating movies.

4. Number of faces in poster is has no effect on IMDB rating.